Method and apparatus for determining emotional arousal by speech analysis

ABSTRACT

An apparatus for determining emotional arousal of a subject by speech analysis, and an associated method. In the method, a speech sample is obtained, the speech sample is pre-processed into silent and active speech segments and the active speech segments are divided into strings of equal length blocks (the blocks having primary speech parameters including pitch and amplitude parameters), a plurality of selected secondary speech parameters indicative of characteristics of equal-pitch are derived, rising-pitch and falling-pitch trends in the strings of blocks, the secondary speech parameters are compared with predefined, subject independent values representing non-emotional speech to generate a processing result indicative of emotional arousal, and the generated processed result is outputted to an output device.

FIELD OF THE INVENTION

The present invention relates to the field of voice and speech analysisand in particular to the analysis of acoustic and prosodic features ofspeech.

This application is a filing under 35 USC 371 of PCT/IL2002/00648 filedAug. 7, 2002.

BACKGROUND OF THE INVENTION

It is long known that certain voice characteristics carry informationregarding the emotional state of the speaker. As far back as 1934, Lynchnoted differences in timing and pitch characteristics between factualand emotional speech. (Lynch, G. E. (1934). A Phonophotographic Study ofTrained and Untrained Voices Reading Factual and Dramatic Material,Arch. Speech. 1, 9-25.)

Since then, many studies have demonstrated correlations between variousnon-verbal speech characteristics and specific emotional states, andresearch efforts have been directed to different aspects of theemotional speech phenomenon. One line of research focuses on identifyingthe carriers of emotion within the speech signal, and studies have showncomplex correlation patterns between pitch (the fundamental voice tone,dependent on the number of vibrations of the vocal cords per second),amplitude, timing, duration, pace, envelope contours and other speechvariables and the emotional state of the speaker. A second research areatries to explore the expression of different emotional dimensions inspeech, and the studies suggest correlations between constituentelements of speech and dimensions characterizing the emotional state ofthe subject. A further research effort focuses on revealing thedistinctive correlations between parts of speech and various emotionalstates including primary emotions, such as anger, secondary emotions,such as boredom, for example, and specific stressful situations, such asanxiety, workload and lying, for example. Yet another area of researchtries to point out the differences in emotional speech patterns betweendifferent individuals, different groups of individuals, as categorizedby sex, age, culture and personality type, for example, and even betweenthe voice patterns corresponding to different physiological states ofthe same individuals.

Three extensive literature reviews, summarizing the various findingsregarding the vocal expression of emotion, were published by Murray, L.R. and Arnott, J. L., (1993), Towards the Simulation of Emotion inSynthetic Speech: A Review of the Literature on Human Vocal Emotion,Journal of the Acoustical Society of America, vol. 93 (2), 1097-1108, byFrick, R. W. (1985), Communicating Emotion The Role of ProsodicFeatures, Psychology Bulletin, 97, 412-429, and by Scherer, K. R.(1986), Vocal Affect Expression: A Review and a Model for FutureResearch, Psychology Bulletin, 99, 143-165. All these writers emphasizethe fragmented nature of the research in this field, and point out thatthe vocal emotion research forms only a very small and isolated part ofthe general emotion literature and the general speech analysisliterature. These reviews support the notion that human voicecharacteristics vary in relation to expression of emotion; yet, theyhighlight the complexity of the interplay between physiology, psychologyand speech regarding emotions. They also stress the need for generalizedmodels for a more coherent understanding of the phenomena.

In recent years, a few studies have approached the task of automaticclassification of vocal expression of different emotional states byutilizing statistical pattern recognition models. Relative success hasbeen achieved, see Dellaert, F., Polzin, T. S. and Waibel, A. (1996),Recognizing emotions in speech. In Proc. ICSLP, Philadelphia Pa., USA,1996 and Amir, N. and Ron, S. (1998), Towards an automaticclassification of emotions in speech. In Proc. ICSLP, Sydney, 1998, forexample.

The field of emotion in speech is attracting increasing interest, and aspecial workshop dedicated to this topic was held in Belfast inSeptember 2001 (ISCA workshop on Speech and Emotion—presented papers:http://www.qub.ac.uk/en/isca/proceedings/index.html). The papers,theoretical and empirical, reveal once more the complexity of thephenomenon, the lack of data and the various aspects that are involved.

In respect to the detection of emotion through speech analysis, theliterature highlights several problems, yet to be resolved. We wouldlike to emphasize two of the major problems:

The first problem is the lack of a unified model of emotional acousticcorrelates, enabling the different emotional content in speech to beaddressed by one general indicator; the current state of the researchonly enables the pointing out of isolated acoustic correlations withspecific emotional states.

The second problem is the difficulty in overcoming the different speechexpression patterns of different speakers, which tend to mask theemotional differences. Prior research has tried to confront the latterproblem by obtaining reference speech characteristics of the testedindividual, or of specific groups of individuals. The references beingprior baseline measurements (non-emotional) of a specific subject, orthe specific emotional speech profiles of relatively homogenous groupsof subjects, such as all subjects suffering from depression, forexample.

Several patents regarding this field have been registered over theyears. These patents are mainly characterized as having the samelimitations described above in regard to the academic research, namely,they focus on specific emotional states and depend on prior referencemeasurements. The patents also vary significantly in their measurementprocedures and parameters.

Fuller, in three U.S. Patents from 1974, (U.S. Pat. No. 3,855,416; U.S.Pat. No. 3,855,417 and U.S. Pat. No. 3,855,418), suggests a method forindicating stress in speech and for determining whether a subject islying or telling the truth. The suggested method measures vibrattocontent (rapid modulation of the phonation) and the normalized peakamplitude of the speech signal, and is particularly directed toanalyzing the speech of a subject under interrogation.

Bell et. al., in 1976 (U.S. Pat. No. 3,971,034), also suggested a methodfor detecting psychological stress through speech. The method describedis based mainly on the measurement of infrasonic modulation changes inthe voice.

Williamson, in two patents from 1978 and 1979 (U.S. Pat. No. 4,093,821and U.S. Pat. No. 4,142,067) describes a method for determining theemotional state of a person, by analyzing frequency perturbations in thespeech pattern. Analysis is based mainly on measurements of the firstformant frequency of speech, however, the differences corresponding tothe different emotional states are not specified clearly: in the firstpatent, the apparatus mainly indicates stress versus relaxation, whereasin the second patent, the user of the device should apply “visualintegration and interpretation of the displayed output” for “makingcertain decisions with regard to the emotional state”.

Jones, in 1984 (U.S. Pat. No. 4,490,840), suggests a method fordetermining patterns of voice-style (resonance, quality), speech-style(variable-monotone, choppy-smooth, etc.) and perceptual-style(sensory-internal, hate-love, etc.), based on different voicecharacteristics, including six spectral peaks and pauses within thespeech signal. However, the inventor states that “the presence ofspecific emotional content is not of interest to the invention disclosedherein.”

Silverman, in two U.S. patents from 1987 and 1992 (U.S. Pat. No.4,675,904 and U.S. Pat. No. 5,148,483) suggests a method for detectingsuicidal predisposition from a person's speech patterns, by identifyingsubstantial decay on utterance conclusion and low amplitude modulationduring the utterance.

Ron, in 1997 (U.S. Pat. No. 5,647,834), describes a speech-basedbiofeedback regulation system that enables a subject to monitor and toalter his emotional state. An emotional indication signal is extractedfrom the subject's speech (the method of measurement is not described inthe patent) and compared to online physiological measurements of thesubject that serve as a reference for his emotional condition. Thesubject can then try to alter the indication signal in order to gaincontrol over his emotional state.

Bogdashevsky, et. al., in a U.S. patent from 1999, (U.S. Pat. No.6,006,188) suggests a method for determining psychological orphysiological characteristics of a subject based on the creation ofspecific prior knowledge bases for certain psychological andphysiological states. The process described involves creation ofhomogenous groups of subjects by their psychological assessment (e.g.personality diagnostic groups according to common psychologicalinventories), analyzing their unique speech patterns (based on cepstralcoefficients) and forming specific knowledge bases for these groups.Matching to certain psychological and physiological groups can beaccomplished by comparing the speech patterns of an individual (who isasked to speak a 30-phrase text similar to the text used by thereference group), to the knowledge bases characteristics of the group.The patent claims to enable verbal psychological diagnosis of relativelysteady conditions, such as comparing mental status before and aftertherapy and personality profile, for example.

Pertrushin, in 2000 (U.S. Pat. No. 6,151,571), describes a method formonitoring a conversation between a pair of speakers, detecting anemotion of at least one of the speakers, determining whether the emotionis one of three negative emotions (anger, sadness or fear) and thenreporting the negative emotion to a third party. Regarding the emotionrecognition process, the patent details the stages required forobtaining such results: First, conducting an experiment with the targetsubjects is recommended, in order “to determine which portions of avoice are most reliable as indicators of emotion”. It is suggested touse a set of the most reliable utterances of this experiment as“training and test data for pattern recognition algorithms run by acomputer”. The second stage is the feature extraction for the emotionalstates based on the collected data. The patent suggests several possiblefeature extraction methods using a variety of speech features. The thirdstage is recognizing the emotions based on the extracted features. Twoapproaches are offered—neural networks and ensembles of classifiers. Thepreviously collected sets of data (representing the emotions) can beused to train the algorithms to determine the emotions correctly.Exemplary apparatuses as well as techniques to improve emotion detectionare presented.

Slaney, in a U.S. patent from 2001 (U.S. Pat. No. 6,173,260), describesan emotional speech classification system. The system described, isbased on an empirical procedure that extracts the best combination ofspeech features (different measures of pitch and spectral envelopeshape), that characterizes a given set of speech utterances labeled inaccordance with predefined classes of emotion. After the system has been“trained” on the given set of utterances, it can use the extractedfeatures for further classification of other utterances into theseemotional classes. The procedure doesn't present any general emotionalindicator however, and only assumes that different emotional featurescan be empirically extracted for different emotional situations.

Two published PCT applications by Liberman also relate to emotion inspeech. Liberman, in 1999 (WO 99/31653), suggests a method fordetermining certain emotional states through speech, including emotionalstress and lying related states, such as untruthfulness, confusion anduncertainty, psychological dissonance, sarcasm, exaggeration. Theprocedure is based on measuring speech intonation information, inparticular, plateaus and thorns in the speech signal envelope, usingprevious utterances of the speaker as a baseline reference.

Liberman, in 2000 (WO 00/62270), describes an apparatus for monitoringunconscious emotional states of an individual from speech specimensprovided over the telephone to a voice analyzer. The emotionalindicators include a sub-conscious cognitive activity level, asub-conscious emotional activity level, an anticipation level, anattention level, a “love report” and sexual arousal. The method used, isbased on frequency spectrum analysis of the speech, wherein thefrequency spectrum is divided into four frequency regions and it isclaimed that a higher percentage of frequencies in one of the regionsreflects dominance of one of the emotional states above. It is suggestedthat cognitive activity would be correlated with the lowest frequencies,attention/concentration with main spectrum frequencies, emotionalactivity with high frequencies, and anticipation level with the highestfrequencies.

Most of the abovementioned patents (Fuller, Bell, Jones, Silverman andLiberman) identify specific emotional states such as stress, lying or atendency to commit suicide, by correlating specific speech features tothese emotional conditions. Two of the patents (Williamson, Ron) assumethat the appropriate speech correlates of the emotional states are givenas input and totally ignore the task of describing any general indicatorof emotional speech features. Three of the patents (Bogdashevsky,Petrushin and Slaney), suggest procedures for the extraction of specificspeech correlates by “learning” given emotional classes of speechutterances. Thus, none of the abovementioned patents suggest ageneralized speech based indicator of emotional arousal per se. thatdescribes the speech expression of the emotional response created by awide range of different emotional states.

Furthermore, in order to overcome the differences between individuals,some of these patents (Fuller, Williamson), require a skilled expert tomanually analyze the results. Other patents (Ron, Liberman) require acomparison of the subject's speech measurements to prior baselinemeasurements of the same individual, as reference. Other patents(Bogdashevsky, Petrushin and Slaney), require a prior learning processof the speech characteristics of specific groups of individuals orspecific psychological phenomena, to be used as reference.

Thus none of the above reviewed patents in this crowded art suggests anemotional speech indicator that is robust, having validity beyonddifferent emotions and beyond the differences between specificindividuals and specific groups. It is to the providing of such arobust, general indicator of emotional arousal by speech analysis, whichis insensitive to the differences between subjects and to particularemotion types, but sensitive to emotional arousal per se. that thepresent invention is directed.

SUMMARY OF THE INVENTION

The present invention is directed to the provision of a generalindicator of emotional arousal of a subject, by speech analysis,applicable to a wide range of different emotional states. This emotionalspeech indicator is valid beyond the speech pattern differences betweenspecific individuals or specific groups of individuals, and does notrequire comparing a speech sample from a subject with a reference speechsample obtained earlier, from the same subject.

There is provided according to the present invention, a method fordetermining emotional arousal of a subject by speech analysis,comprising the steps of: obtaining a speech sample; pre-processing thespeech sample into silent and active speech segments and dividing theactive speech segments into strings of equal length blocks; the blockshaving primary speech parameters including pitch and amplitudeparameters; deriving a plurality of selected secondary speech parametersindicative of characteristics of equal-pitch, rising-pitch andfalling-pitch trends in the strings of blocks; comparing the secondaryspeech parameters with predefined, subject independent valuesrepresenting non-emotional speech to generate a processing resultindicative of emotional arousal, and outputting the generated processedresult to an output device.

Preferably the method of deriving further includes deriving a pluralityof selected secondary speech parameters indicative of pause and silencecharacteristics of the speech sample being analyzed, optionallyincluding analyzing irregularity of pace and rhythm, pitch, andamplitude of the speech sample being analyzed.

Optionally the plurality of selected secondary speech parameters areselected from the list of: average pause length and/or pause frequency;average length of short silences and/or short silences frequency;average length of equal pitch segments and/or equal pitch segmentsfrequency; rising pitch segments length average and/or rising pitchsegments frequency and/or falling pitch segments length average and/orfalling pitch segments frequency; and the average amplitude dispersionwithin equal pitch segments of speech.

Optionally the step of obtaining a sample of speech comprises the stepof inputting a digitized voice file. Alternatively, the step ofobtaining a sample of speech comprises the step of capturing speechspecimens and sampling and digitizing the speech specimens in a voicesampling and digitizing unit to form a digitized voice file.

Optionally, the step of pre-processing includes: obtaining digitizedvoice samples, normalizing said voice samples, data filtering,noise-reduction, segmenting the voice samples into silence and speechsegments, dividing the speech segments into blocks, and processing theblocks by auto-correlation, to calculate pitch and amplitude voiceparameters per block.

In one embodiment, the method described hereinabove may be adapted foranalyzing a speech signal including a plurality of interacting voices,by it further comprising the additional steps of: separating theinteracting voices into separate voice channels, obtaining digitizedvoice samples, performing samples normalization for each channel ofinterest, performing data filtering for each channel of interest,performing noise-reduction for each channel of interest, performingsilence and speech segmentation and dividing the speech segments intoblocks for each channel of interest, and auto-correlation processing tocalculate pitch and amplitude voice parameters per block for eachchannel of interest.

Optionally, the step of deriving includes: marking a speech segment of apre-defined length for processing; calculating pauses related parametersfor said speech segment; calculating silences related parameters for thespeech segment; joining blocks into strings of blocks categorized asbeing strings of blocks having rising pitch trends, strings of blockshaving falling pitch trends and strings of blocks having equal pitchtrends; calculating pitch related parameters within the speechsegment—the pitch related parameters being selected from the list offrequency and average lengths of strings of blocks characterized byhaving rising, falling or equal pitch trends, and average amplitudedispersion of strings of blocks having equal pitch, and classifying thespeech segment into one of several categories of typical parameterrange.

Optionally, the step of comparing the secondary speech parameters withpredefined, subject independent values representing non-emotional speechto generate a processing result indicative of emotional arousal includescomparing at least two secondary voice parameter categories withpre-defined values representing non-emotional speech, the categoriesbeing selected from the list of: average pause length and/or pausefrequency; average length of short silences and/or short silencesfrequency; average length of equal pitch segments and/or equal pitchsegments frequency; rising pitch segments length average and/or risingpitch segments frequency and/or falling pitch segments length averageand/or falling pitch segments frequency; and the average amplitudedispersion within equal pitch segments of speech.

Optionally, the method further comprises calculating a reliability gradebased on at least one factor selected from the list of: quality of voicesegment; significance of emotional arousal decision, and consistency ofspecific segment results with results of previous speech segments.

Preferably, the quality of voice segment is determined, based on noiselevel, size of sampled data, and quality of sampled data.

Preferably, the significance of emotional arousal decision isdetermined, based on number of participating parameters and degree ofdeviation within each parameter.

Optionally, there is an additional step of pattern processing to detectemotional patterns that are revealed along a time axis.

In a second aspect, the present invention is directed to an apparatusfor speech analysis comprising: a voice input unit, a pre-processingunit for pre-processing voice samples from voice input unit, a mainprocessing unit for processing said pre-processed voice samples anddetecting emotional arousal therefrom; and a main indicators output unitfor outputting an indication of emotional arousal.

Optionally, the voice input unit includes a voice capturing unit and avoice sampling and digitizing unit coupled to the voice capturing unitfor sampling and digitizing captured voice input.

Optionally, the voice input unit includes either a microphone, aninterface to an audio player, an interface to a wired, wireless orcellular telephone, an interface to the Internet or other network, aninterface to a computer, an interface to an electronic personalorganizer or to any other electronic equipment, an interface to a toy.

Preferably, the voice sampling and digitizing unit is selected from asound card, or a DSP chip-based voice sampling and digitizing device.

Preferably, the main indicators output unit is selected from a localoutput device, a display, a speaker, a file, a storage unit ormonitoring device; or an interface to a remote computer, to theInternet, to another network, to a wired, wireless or cellulartelephone, to a computer game, to a toy, to an electronic personalorganizer or to any other electronic output equipment.

Optionally, all the aforementioned units are installed on a small,mobile, DSP chip based unit. Alternatively, some of the units may bephysically distanced from other units, and the apparatus may furthercomprise an interface for allowing data communication between the units.

The pre-processing and processing units may alternatively beincorporated within a software tool capable of integrating with anexternal source of digitized voice input and with an external outputdevice.

By primary speech parameter, as used herein, absolute values ofparameters such as pitch or intensity are meant. By secondary speechparameter, the variation in the absolute values of the parameters usedherein is meant. Thus secondary speech parameters are derived statisticsthat are generally less susceptible to cultural, age and genderdifferences, background interference, quality of signal analyzed andother distorting factors, and the secondary speech parameters used forindicating emotional arousal in preferred embodiments of the presentinvention, are selected as being particularly robust, having lowsensitivities to the differences between individuals and to backgroundinterference.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be further understood and appreciated fromthe following detailed description taken in conjunction with thedrawings in which:

FIG. 1 is a block diagram illustration of an apparatus constructed andoperative in accordance with one embodiment of the present invention;

FIG. 2 is a flow chart of a pre-processing unit constructed andoperative in accordance with one embodiment of the present invention;and

FIG. 3 is a flow chart of a main processing unit constructed andoperative in accordance with one embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates to a method and apparatus for detectingemotional arousal through speech analysis. The term ‘emotional speech’is used herein, in regard to a speech segment in which the speakerexpresses himself in an emotional manner. Non-emotional speech refers toa speech segment in which the speaker does not express himself in anemotional manner. Past descriptions of experiences and feelings orfuture expectations for desired or undesired events may be consideredemotional speech only if the actual described or desired feeling orevent is currently expressed in an emotional manner. The literature andpatents reviewed hereinabove, support clearly the phenomenon thatdifferent emotional states, when expressed vocally, alter the speechcharacteristics of a subject, in comparison to the characteristics ofnon-emotional speech. However, providing a general indicator that candetermine the emotional arousal level of a person through speechanalysis is still a very difficult task, mainly because of threefactors:

1. Different emotional states affect differently the speechcharacteristics of an individual.

2. Voice and speech characteristics vary significantly betweenindividuals.

3. Different emotional intensity (of the same emotion) affects differentelements of speech to different extents.

In order to overcome the effect of these factors, most existing researchand patents follow two guidelines: They separate the measurement ofdifferent types of emotions and they use prior samples to obtain acomparable reference baseline.

The present invention suggests an automatic, real time, speech analysismethod for indicating the existence of a level of generalized emotionalarousal of a subject at a given time, beyond specific emotion states andbeyond specific differences between individuals, without using areference speech baseline specific to the subject himself.

Eliminating the need for a specific reference baseline, thegeneralization of emotional arousal voice characteristics beyondspecific emotional states and the emotional detection method based onpitch trends within the speech segment are three novel features of thepresent invention.

1. Emotional Arousal Beyond Specific Emotional States

A central assumption underlying the present invention is thatnon-emotional speech reflects an equilibrium state, and that emotionalspeech reflects a deviation from this balance. Emotional arousal isknown to be a deviation from a physiological equilibrium in certainemotional states such as stress, for example. It is expressed in changesin autonomic system variables, such as heartbeat rate, muscle activity,galvanic skin resistance, blood pressure and blood temperature. In acorresponding manner, it is proposed that the changes in the speechpatterns during emotional arousal may reflect a deviation from thebalanced, ordered non-emotional state, and the present invention isbased on the principle that the speech characteristics during emotionalarousal are less systematic and more disordered than the characteristicsof non-emotional speech. The violation of the ordered speech rhythmcorresponding to extreme emotional arousal or excitement, such as cryingor shouting, for example, is clear to most listeners. There are similar,corresponding changes in the ordered speech patterns that express minorexcitement levels as well.

Although different emotional states may produce different speechcharacteristics, it is suggested that a common factor of speechcharacteristics in many different, emotionally aroused states, lies inthe irregularity of the speech patterns when compared with the moresystematic nature of non-emotional speech. Similarly, although differentindividuals who are emotionally aroused, or excited, may have differentspeech characteristics, it is nevertheless suggested that common tonearly all such emotionally aroused individuals, are less ordered speechpatterns as compared to their general, non-emotionally aroused speechpatterns. The present invention focuses on the measurement of thiscommon factor, as an indicator highlighting the individual's generalemotional arousal.

As reported in the literature, the expression of different emotionalstates has been found to correlate with specific speech characteristics.In contradistinction, we propose herein, that two types of variablestend to characterize ‘emotional arousal’ itself, rather than specificemotional states. The first variable, referred to herein asconstant-pitch presence, is the degree of presence of equal-pitchperiods within the speech segment, and the second variable is theconsistency level of different speech characteristics, which is ameasurement of the ordering of the speech pattern.

Constant pitch presence: As a general rule, it is suggested thatemotional speech is characterized by lower presence of equal-pitchperiods and higher presence of changing (rising or falling) pitchperiods, meaning that emotional speech displays a smaller number persecond and a shorter average length of equal-pitch periods within thespeech segment as compared to regular, non-emotional speech. It shouldbe noted that we do not suggest that emotional speech will always becharacterized by a higher pitch variation/range or by higher frequencyof pitch direction changes (rising/falling) within the speech segment,since the latter variables are more affected by specific emotionalstates, by individual differences and by speech loudness. Incontradistinction, we suggest that the constant pitch presenceparameters are less affected by the aforementioned intervening factors,than are the higher pitch variation/range/frequency of changesparameters. Consequently, they are strongly indicative of emotionalarousal.

Consistency level of different speech characteristics: As mentioned, itis suggested that irregularity in speech patterns relates to emotionalexpression.

The general, less-ordered behavior of speech characteristics is evidentthrough a higher inconsistency of several speech variables, such as thelength and dispersion of intervals between sequential pauses andsilences, the length of the pauses and the silences themselves and thelength, frequency and dispersion of different types of non-silentsegments (e.g. length of rising and falling pitch periods). Similarly tothe measurement of equal-pitch presence, emphasis is put on measuringevents on the time scale—number per second, lengths, intervals anddispersion of specific speech variables or grouped periods within thespeech segment. These time-based variables are generally less affectedthan pitch and amplitude variables by intervening and biasing factors.Detecting a combination of deviations in some of these variables from anordered speech structure can reveal the irregularity in speech patternsthat relates to emotional arousal.

2. Overcoming the Effect of Individual Speech Patterns

As stated hereinabove, patterns of voice and speech vary significantlyfrom one person to another. Some of these differences are of a generalnature. For example, statistically, women's speech has a higher pitchthan men's speech. Other differences are more specific. For example, theindividual's speech has a typical pitch characteristic of thatindividual, and there are many other speech tendencies that characterizethe speech of particular individuals, such as monotonous speech, pausedspeech etc.

In the embodiments of the present invention, to overcome the biasingeffects due to the general characteristics of the individual's speechpatterns, the determination of the general emotional arousal level ofthe subject makes selective use of secondary voice pitch parameters, andselective use of secondary voice amplitude parameters.

Use of secondary rather than primary speech parameters: Speech analysisin accordance with embodiments of the present invention uses mainlysecondary voice and speech parameters and disregards primary parameters.

For the purposes of this application, the term secondary voiceparameters imply parameters derived from the primary pitch and amplitudeparameters, and not the primary parameters themselves. Primaryparameters are greatly affected by the differences between individuals,and hence are not considered, or at least are not given weightyconsideration in analyses performed in accordance with the presentinvention. Thus the voice frequency value, or pitch, itself, isgenerally not used as a parameter, since it varies significantly betweendifferent people. However, pitch changes within the speech segments areemphasized, since these contribute relative, rather than absolutevalues, and are, therefore, less affected by the differences betweenindividuals.

Selective use of secondary voice pitch parameters: Secondary voiceparameters are also sensitive, to a degree, to the differences betweenspeech patterns of different individuals. The speech processing of thepresent invention ignores most of the secondary parameters most affectedby these differences.

An example of a secondary voice pitch parameter not used, is the rangeof pitch change. This is considered to be a secondary parameter, sinceit represents only the relative changes of the speaker's pitch, and notthe pitch itself. However, since this parameter correlates strongly tothe actual pitch value, it is often markedly affected by the differencesbetween individuals, and not only by the state of emotional arousal perse. Consequently, speech processing in accordance with the presentinvention typically ignores this parameter, and, likewise, othersecondary parameters that vary significantly with the individual.

Selective use of secondary voice amplitude parameters: Many voiceamplitude parameters, both primary and secondary, are more affected byspeech differences between individuals than are pitch parameters.Amplitude parameters are also very susceptible to the general quality ofthe voice signal analyzed, which are adversely effected by environmentaleffects, such as interference, such as audio noise, and by electronicnoise associated with the various components of the analysis equipment.Consequently, determining the existence of emotional arousal inaccordance with the present invention puts little emphasis on amplitudeparameters, both primary and secondary.3. Overcoming the Effects of IntensityAlthough the magnitude of the emotional arousal of a subject issometimes indicated by the magnitude (volume) of the speech itself, thisis not always the case. For example, when a person shouts in anger,usually his voice pitch, voice amplitude and speech speed increase,causing a corresponding increase in many secondary speech parameters aswell, however, the speech profile of one shouting in anger may be verydifferent from the speech profile of one displaying a less excited formof anger, although both represent emotional arousal. There are somepeople who demonstrate anger by talking quietly and deliberately, forexample.

The present invention is focused on the detection of emotional arousalper se. and not only intense emotional arousal, or emotional arousalcorresponding to any particular emotion. Moreover, since differences inspeech volume that are not related to emotional arousal may affectspeech characteristics in a biasing way, for instance by influencing thevolatility level of certain speech parameters, it is important tominimize, as much as possible, the effects of speech volume on speechprocessing. This may be accomplished by following the same guidelines asthose detailed above in regard to the overcoming of the effects ofindividual speech patterns, including the selective use of mainlysecondary pitch and amplitude parameters. Still, in order to reduce thesensitivity of the processing to the effect of voice magnitude evenfurther, additional processing is preferably performed. The maininfluence that the audible volume of speech has on the speech is onincreasing or decreasing the ranges of its parameters. Consequently, thespeech processing of the present invention generally makes an initialclassification of each processed speech segment in accordance with oneof several typical parameter range behavior classes. This initialclassification enables the processing to use different criteria fordetermining the existence of emotional arousal in different parameterrange classes.

4. Determining the Existence of Emotional Arousal

As mentioned hereinabove, after minimizing the different biasingeffects, the speech characteristics that are associated most directlywith emotional arousal have been found to be the degree ofconstant-pitch presence, the irregularity of pace, rhythm and otherspeech pattern indicators.

More specifically, the algorithm of the present invention uses acombination of at least two, and preferably more, of the followingspeech parameter categories:

-   -   Pause length average and/or pause frequency    -   Short silences length average and/or short silences frequency    -   Equal pitch segments length average and/or equal pitch segments        frequency    -   Rising pitch segments length average and/or rising pitch        segments frequency and/or falling pitch segments length average        and/or falling pitch segments frequency.    -   Amplitude dispersion within equal pitch segments of speech

By ‘pauses’ relatively long silences in speech are intended, Pauses aretypically about 0.25-1.5 second breaks in speech, usually appearingbetween sentences, for example.

By ‘short silences’, breaks having durations of less than about 0.25seconds are intended. Short silences are the silences that typicallyappear between words and between syllables.

‘Equal pitch segments’ are continuous segments of speech that arecharacterized by having relatively stable pitch, that is, by the pitchvarying between preset tolerances.

In contradistinction, ‘Rising and falling pitch segments’ are segmentscharacterized by a continuous and definite rising or falling trend ofpitch.

The determination of emotional arousal with a high degree of certaintyrequires that a combination of at least two, (preferably more), of theabove parameters simultaneously deviate from non-emotional values.However, preferably the decision as to whether the subject indeeddisplays an emotional arousal may also be made dependent on the degreeof the deviation of each parameter, with ranges and values thatcharacterize regularity for each parameter having been determined byanalysis of large samples of speech data taken from the generalpopulation.

Referring now to FIG. 1, there is shown a block diagram illustration ofan apparatus for detecting emotional arousal constructed and operativein accordance with one embodiment of the present invention. Theapparatus includes a voice input unit 10, a voice sampling anddigitizing unit 12, a pre-processing unit 14, a main processing unit 16,and a main indicators output unit 18. Voice input unit 10 can be anydevice that carries human voice data in any form—microphone, wiredtelephone, wireless or cellular telephone, any audio-player device (suchas tape-recorder, compact-disc), digitized voice files, Internetconnection (voice over IP, cable, satellite or any other method). Voicesampling and digitizing unit 12 can be a computer sound card, a specificDSP chip or any other sampling and digitizing device.

The emotional arousal determination procedure, according to the presentinvention, is as follows (with some variations between differentembodiments of the apparatus). The flow chart in FIG. 2 details thepre-processing stage and the flow chart in FIG. 3 details the mainprocessing stage.

-   -   (a) Pre-processing—The pre-processing function serves to prepare        the raw data for the processing itself. More specifically, it        serves to obtain pitch and amplitude parameters per each speech        block of a predefined length. The processor is a CPU unit, which        may be the CPU of a PC, or may be a specific, dedicated DSP chip        or indeed any other suitable processing device. The        pre-processing includes the following processing steps, which        are widely recognized by those familiar with the art of signal        processing (FIG. 2):        -   Obtaining digitized voice samples (block 20)        -   Separation of group speech into individual voice channels'            samples when required. For example, when the voice input is            a telephone conversation, it is preferably divided into two            voice channels, each representing a different speaker,            possibly by separate sampling with one signal being obtained            via the mouthpiece of one of the telephones, for example            (block 22). Obviously, the pauses and the length of phrases            in dialogue are significantly different from those in            monologue, and these differences are appreciated and allowed            for.        -   Normalization of the samples' values—performed for both            channels (block 24)        -   Data filtering—performed for both channels (block 26)        -   Noise-reduction—performed for both channels (block 28)        -   Initiation of segmentation and basic parameters calculation            for the first channel (block 30)        -   Silence and speech segmentation and dividing the speech            segments into blocks (block 32) are performed for the first            channel.        -   Auto-correlation (block 34) to calculate pitch and amplitude            is performed for the first channel.        -   When there are two speakers, the segmentation and            auto-correlation steps (blocks 30, 32, 34 above) are now            performed for the second voice channel, if present (blocks            36 and 38).

The outputs of the pre-processing steps are strings of speech segmentblocks characterized by having pitch and amplitude values per block andlengths for silence and pause segments.

-   -   (b) Processing—The main processing procedure provides an        indication of emotional arousal. It may be performed on the same        CPU processor where the pre-processing was performed, or        alternatively, on a different CPU unit. The processing unit may        be the CPU of a PC, a specific DSP chip or any other suitable        processing device. The processing procedure includes the        following processing steps, for each channel (FIG. 3):        -   Selecting a short, speech segment, typically, 3-6 seconds of            speech, for processing (block 40).        -   Calculating pause-related parameters of the speech segment,            including the average number of pauses per second and the            average pause length (block 42).        -   Calculating silence-related parameters of the speech            segment, including the average number of silences per second            and the average silence length (block 43).        -   Determining which segment strings of blocks are segment            strings having equal pitch blocks, by marking the            consecutive blocks having relatively constant pitch (that            is, within acceptable tolerances) (block 44).        -   Determining which segment strings of blocks display rising            or falling pitch trends (block 46).        -   Calculating the secondary pitch parameters of the speech            segment, such as the average number per second and the            average length of rising, falling and equal pitch periods            and the amplitude dispersion of equal pitch periods (block            47).        -   Classifying the processed speech segment into one of several            categories of typical parameter ranges, in order to            differentiate segments with different speech magnitudes            (block 48).        -   Determining the emotional arousal indication of the speech            segment. This indicator is based on comparison of the            calculated voice parameters with pre-defined values            representing non-emotional speech, and scoring the            combination of irregularities (block 50).    -   If a second channel is present, i.e., taking specimens from two        participants in a conversation (blocks 52 and 54), the same        processing steps 40 to 50 are carried out on the sample from the        second channel.        -   Preferably, the method also includes calculating a            reliability grade—based on a combination of several factors,            typically including the quality of the voice segment (noise            level, size of sampled data, quality of sampled data),            significance of the emotional arousal decision (number of            participating parameters, degree of deviation of each            parameter), consistency of the specific segment results with            the previous speech segments (Emotional changes should            follow reasonable patterns regarding the number of changes,            their intensity, their length and switching between emotions            in a given period of time).        -   Pattern processing—The processing may include another layer,            which detects certain emotional patterns that are revealed            with the passage of time, or when compared to other results.            For example, when analyzing a conversation, comparing the            emotional states of the two speakers enables the detection            of patterns in the interpersonal communication, such as            attachment, detachment, politeness, conversation atmosphere            and progress.    -   (c) Output (FIG. 1, block 18)—The emotion measurement results        can be sent to various outputs in accordance with the specific        apparatus configuration used, and in accordance with the        specific application. Normally, the output will be sent to a        user's real time display (visual, vocal, or textual). It may be        reported to a remote user through any kind of networking and it        may be logged or stored to any sort of output or storage device        or file.        5. Apparatuses and Possible Applications

By way of example, two basic apparatuses are presented for the patentimplementation, although any other suitable apparatus can alternativelybe employed:

(a) A small, mobile, DSP chip based unit. This apparatus can serve as asmall mobile unit for emotional arousal detection in real-time oroffline analysis. It can be used as a stand-alone device ininterpersonal face-to-face interactions. Alternatively, it can beconnected to input or output devices such as computer, audio player,wired or wireless or cellular telephone, electronic personal organizer,Internet or any other network, in order to obtain various local orremote voice inputs and to display or report to various local or remoteoutputs. It can also be integrated as hardware into other devices, suchas wired, wireless or cellular telephones, computer games, toys,computers or any other electronic equipment. The apparatus includes amicrophone (or any other input interface), digital sampler, processorand display (or any other output interface).(b) A software-based tool. This apparatus can serve as a computer-basedtool for emotion arousal detection in real-time or offline analysis. Itcan be used as a stand-alone software tool for analysis of digitizedvoice files. Alternatively, it can be connected through the computerinterfaces to any input/output device, in order to obtain any local orremote voice input, and display or report to various local or remoteoutputs, such as microphones, audio players, wired or wireless orcellular telephones, Internet or any other network, other computers orany other electronic equipment. The software tool can also be integratedas a subsystem into another system. Such systems include call/contactcenter software for example, or hardware which monitors, records oranalyzes conversations, various situation and personal trainers or anymonitoring, teaching or feedback system. The emotion software tool willtypically be installed into a computer environment that typicallyincludes a microphone (or any other input interface), sampling anddigitizing unit, processor, display (or any other output interface) andany other relevant external interface.

It will be appreciated that the present invention has a very wide rangeof possible applications and implementations. A few of the possibilitiesare listed below, by way of example only. However, the use of thepresent invention is not limited to those applications described herein.

Emotion monitoring can be used to improve marketing, sales, service andrelations with customers, especially in the call center environment. Theemotion monitoring, feedback and supervision of the service/salesinteractions can be implemented in a real time environment, as well asin off line analysis. The monitoring can be implemented with bothapparatuses described above: It can be integrated as a software toolinto other call center products, such as recording tools, CRM (customerrelation management) products, training tools or E-commerce software. Itcan be installed as a stand-alone software tool in the call center, CRMor E-commerce environments and it can also be integrated into varioushardware devices in these environments as a DSP chip based unit. A smallDSP chip based unit can also be used as an independent small unit formonitoring face-to-face agent-customer interactions.

Emotion monitoring can be used to improve the training process ofvarious professional personnel by improving awareness of emotional, aswell as non-emotional, verbal patterns, as expressed in a speaker'svoice. In addition, the monitoring tool can be used for demonstrationpurposes (analyzing speech segments of different emotions and differentemotion expression patterns) and for training in controlling emotionexpression (feedback of user's emotions plus reward for altering emotionor expression pattern).

Emotional monitoring can be used as an assisting tool in variousinterpersonal managerial tasks, such as interviewing or negotiating, inmeetings, or even when simply speaking on the telephone.

Monitoring emotion may be useful as an additional tool for psychologicaltesting, and for diagnosis and treatment of specific illnesses,including psychiatric illnesses, for example. This monitoring can beconducted during real time conversations, or in off line analysis ofrecorded conversation, and it can be operated in face to faceinteractions, or when interaction occurs via the telephone or in vocaltelecommunication over the Internet.

Advertising can also benefit from emotional monitoring, by addingsignificant value to the process of measuring and evaluating people'sattitudes in verbal questionnaires, focusing groups, and other methods.

Emotional monitoring can be used to aid in speech therapy and toincrease relaxation and to achieve more control over positive andnegative emotional states. Altering the emotional state can be achievedeither as a direct result of the increased awareness, or through aprocedure similar to a biofeedback mechanism. One important applicationmay be to assist the many programs aimed at reducing violent behavioramong children and adults, where the monitoring can help to demonstrateand to alter patterns of verbal anger.

The use of emotion monitoring can provide an added quality to computerand electronic games, both educational and recreational games. Emotionmonitoring can also be part of toys and games that interact with a childand reflect to him his emotional state.

Emotion monitoring in accordance with the present invention can also beused to improve speech recognition in various applications, and toenhance the interaction between a computer or robot and its user, bypermitting the electric device to respond to the emotional state ofpeople around it.

Emotion monitoring can even be used as a tool for detecting some mentalstates, which have distinctive voice characteristics, such as fatigue.

It will be appreciated that the invention is not limited to what hasbeen described hereinabove merely by way of example. Rather, theinvention is limited solely by the claims which follow.

1. A method for determining emotional arousal of a subject by speechanalysis, comprising the steps of: obtaining a speech sample;pre-processing the speech sample into silent and active speech segmentsand dividing the active speech segments into strings of equal lengthblocks; said blocks having primary speech parameters including pitch andamplitude parameters; deriving a plurality of selected secondary speechparameters indicative of characteristics of equal-pitch, rising-pitchand falling-pitch trends in said strings of blocks; comparing saidsecondary speech parameters with predefined, subject independent valuesrepresenting non-emotional speech to generate a processing resultindicative of emotional arousal, and outputting said generated processedresult to an output device, wherein said secondary speech parameterscomprise: (a) average length of short silences and number of shortsilences per unit of time; (b) average length of equal pitch segmentsand number of equal pitch segments per unit of time; (c) rising pitchsegments length average and number of rising pitch segments per unit oftime and falling pitch segments length average and number of fallingpitch segments per unit of time; and (d) average amplitude dispersionwithin equal pitch segments of speech.
 2. The method according to claim1, wherein said step of deriving further includes deriving a pluralityof selected secondary speech parameters indicative of pause and silencecharacteristics of the speech sample being analyzed.
 3. The methodaccording to claim 1, including analyzing irregularity of pace andrhythm, pitch, and amplitude of the speech sample being analyzed.
 4. Themethod according to claim 1, wherein said step of obtaining a sample ofspeech comprises the step of inputting a digitized voice file.
 5. Themethod according to claim 1, wherein said step of obtaining a sample ofspeech comprises the step of capturing speech specimens and sampling anddigitizing the speech specimens in a voice sampling and digitizing unitto form a digitized voice file.
 6. The method according to claim 1,wherein the step of pre-processing includes: obtaining digitized voicesamples, normalizing said voice samples, reducing noise, segmenting saidvoice samples into silence and speech segments, dividing the speechsegments into blocks, and processing said blocks by auto-correlation, tocalculate pitch and amplitude speech parameters per block.
 7. The methodaccording to claim 1, adapted for analyzing a speech signal including aplurality of interacting voices, further comprising: separating theinteracting voices into separate voice channels; performing samplesnormalization for each channel of interest; performing noise-reductionfor each channel of interest; performing silence and speech segmentationand dividing the speech segments into blocks for each channel ofinterest, and performing auto-correlation processing to calculate pitchand amplitude speech parameters per block for each channel of interest.8. The method according to claim 1, wherein the step of derivingincludes: marking a speech segment of a pre-defined length forprocessing; calculating pauses related parameters for said speechsegment; calculating silences related parameters for said speechsegment; joining blocks into strings of blocks categorized as beingstrings of blocks having rising pitch trends, strings of blocks havingfalling pitch trends and strings of blocks having equal pitch trends;calculating pitch related parameters within the speech segment, saidpitch related parameters comprising average length of equal pitchsegments, number of equal pitch segments per unit of time, rising pitchsegments length average, number of rising pitch segments per unit oftime, falling pitch segments length average, number of falling pitchsegments per unit of time and average amplitude dispersion within equalpitch segments of speech, and classifying the speech segment into one ofseveral categories of typical parameter range.
 9. The method accordingto claim 1, further comprising calculating a reliability grade based onquality of voice segment, significance of emotional arousal decision,and consistency of specific segment results with results of previousspeech segments.
 10. The method according to claim 9, wherein saidquality of voice segment is determined, based on noise level, size ofsampled data, and quality of sampled data.
 11. The method according toclaim 9, wherein said significance of emotional arousal decision isdetermined, based on number of participating parameters and degree ofdeviation within each parameter.
 12. The method according to claim 1,further comprising pattern processing to detect emotional patterns thatare revealed along a time axis.
 13. The method according to claim 1,wherein: said obtaining step is carried out using a voice input unit;said preprocessing is carried out using a pre-processing unit forpre-processing voice samples from voice input unit; said deriving andcomparing steps are carried out using a main processing unit forprocessing said pre-processed voice samples and detecting emotionalarousal therefrom; and said processing result is outputted to a mainindicators output unit for outputting an indication of emotionalarousal.
 14. The method according to claim 13, wherein said voice inputunit of the apparatus includes a voice capturing unit and a voicesampling and digitizing unit coupled to said voice capturing unit forsampling and digitizing captured voice input.
 15. The method accordingto claim 13, wherein said voice input unit of the apparatus includes atleast one element selected from the group consisting of: a microphone,an interface to an audio player, an interface to a wired, wireless orcellular telephone, an interface to the Internet or other network, aninterface to a computer, an interface to an electronic personalorganizer or to any other electronic equipment, and an interface to atoy.
 16. The method according to claim 14, wherein said voice samplingand digitizing unit of the apparatus is selected from the groupconsisting of a sound card, and a DSP chip-based voice sampling anddigitizing device.
 17. The method according to claim 13, wherein saidmain indicators output unit of the apparatus is selected from the groupconsisting of a local output device, a display, a speaker, a file, astorage unit, a monitoring device, and an interface to a remotecomputer, to the Internet, to another network, to a wired, wireless orcellular telephone, to a computer game, to a toy, to an electronicpersonal organizer or to any other electronic output equipment.
 18. Themethod according to claim 13, wherein all said units of the apparatusare installed on a small, mobile, DSP chip based unit.
 19. The methodaccording to claim 13, wherein some of said units of the apparatus arephysically distanced from other units, and said apparatus furthercomprises an interface for allowing data communication between saidunits.
 20. The method according to claim 13, wherein said pre-processingand said processing units of the apparatus incorporate a software toolcapable of integrating with an external source of digitized voice inputand with an external output device.